18 research outputs found

    A methodology for efficient code optimizations and memory management

    Get PDF
    The key to optimizing software is the correct choice, order as well parameters of optimizations-transformations, which has remained an open problem in compilation research for decades for various reasons. First, most of the compilation subproblems-transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size and associativity) and algorithm characteristics (e.g. data reuse); therefore compiler designers and researchers either do not take them into account at all or do it partly. Third, the search space (all different transformation parameters) is very large and thus searching is impractical. In this paper, the above problems are addressed for data dominant affine loop kernels, delivering significant contributions. A novel methodology is presented that takes as input the underlying architecture details and algorithm characteristics and outputs the near-optimum parameters of six code optimizations in terms of either L1,L2,DDR accesses, execution time or energy consumption. The proposed methodology has been evaluated to both embedded and general purpose processors and for 6 well known algorithms, achieving high speedup as well energy consumption gain values over gcc compiler, hand written optimized code and Polly

    Combining software cache partitioning and loop tiling for effective shared cache management

    Get PDF
    One of the biggest challenges in multicore platforms is shared cache management, especially for data dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack of efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the co-running workloads. To the best of our knowledge, this is the first time that a methodology provides i) a theoretical foundation in the above mentioned cache management mechanisms and ii) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions in a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity) and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)- optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

    Cache partitioning + loop tiling: A methodology for effective shared cache management

    Get PDF
    In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at the same time the number of arithmetical/addressing instructions in a minimal level. We also present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

    A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

    Get PDF
    In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in thememory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space

    Efficient FPGA implementations of volterra DFES for optical systems

    Get PDF
    In this work suitable architectures and high-throughput FPGA implementations of Volterra Decision Feedback Equalizers (VDFEs) for optical communication links are presented. Two VDFE configurations were selected based on the available resources of the employed FPGA devices, and two multiplexer-based architectures were developed for each of them in order to achieve the target throughput. The comparison of the experimental results with respect to different VDFE configurations, throughput, and FPGA devices points out the platform-specific design characteristics. The introduced architectures meet the desired 10Gb/s throughput, so it is demonstrated that the FPGA is a suitable platform for high-speed optical fiber communication systems

    Array size computation under uniform overlapping and irregular accesses

    Get PDF
    The size required to store an array is crucial for an embedded system, as it affects the memory size, the energy per memory access, and the overall system cost. Existing techniques for finding the minimum number of resources required to store an array are less efficient for codes with large loops and not regularly occurring memory accesses. They have to approximate the accessed parts of the array leading to overestimation of the required resources. Otherwise, their exploration time is increased with an increase over the number of the different accessed parts of the array. We propose a methodology to compute the minimum resources required for storing an array which keeps the exploration time low and provides a near-optimal result for regularly and non-regularly occurring memory accesses and overlapping writes and reads

    FPGA implementation of a MIMO DFE in 40 GB/S DQPSK optical links

    Get PDF
    In this paper, an FPGA implementation of a Multi Input Multi Output (MIMO) Decision Feedback equalizer (DFE) is proposed, for the electronic compensation of the impairments in 40Gb/s Intensity Modulated Direct Detection (IM/DD) optical communication links employing NRZ DQPSK signaling. The proposed equalizer is used for the electronic compensation the residual Chromatic Dispersion (CD) along the installed optically compensated optical paths. The required processing rate is achieved by applying intensive pipelining and parallelism in the original architecture of the equalizer. At the given processing rate, a 8-input 2-output DFE involving three taps feedforward filtering and two taps backward filtering is implemented on a single, cutting edge technology, Xilinx FPGA device

    A template-based methodology for efficient microprocessor and FPGA accelerator co-design

    Get PDF
    Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs

    Illegal logging detection based on acoustic surveillance of forest

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. In this article, we present a framework for automatic detection of logging activity in forests using audio recordings. The framework was evaluated in terms of logging detection classification performance and various widely used classification methods and algorithms were tested. Experimental setups, using different ratios of sound-to-noise values, were followed and the best classification accuracy was reported by the support vector machine algorithm. In addition, a postprocessing scheme on decision level was applied that provided an improvement in the performance of more than 1%, mainly in cases of low ratios of sound-to-noise. Finally, we evaluated a late-stage fusion method, combining the postprocessed recognition results of the three top-performing classifiers, and the experimental results showed a further improvement of approximately 2%, in terms of absolute improvement, with logging sound recognition accuracy reaching 94.42% when the ratio of sound-to-noise was equal to 20 dB

    Area-throughput trade-offs for SHA-1 and SHA-256 hash functions’ pipelined designs

    Get PDF
    High-throughput designs of hash functions are strongly demanded due to the need for security in every transmitted packet of worldwide e-transactions. Thus, optimized and non-optimized pipelined architectures have been proposed raising, however, important questions. Which is the optimum number of the pipeline stages? Is it worth to develop optimized designs or could the same results be achieved by increasing only the pipeline stages of the non-optimized designs? The paper answers the above questions studying extensively many pipelined architectures of SHA-1 and SHA-256 hashes, implemented in FPGAs, in terms of throughput/area (T/A) factor. Also, guides for developing efficient security schemes designs are provided. Read More: https://www.worldscientific.com/doi/abs/10.1142/S021812661650032
    corecore